Outline 0 Introduction In this paper…… An overview of the purpose of your analysis including details or references necessary.
1 Overview 1.1 Business Problem 1.2 Business Value Proposition
2 Data Description ◦the source ◦the variables with definitions or a link to a codebook ◦the number of observations in your data set ◦detail on missingness ◦a glimpse of your data if possible (e.g. head and tail)
3 Data Preprocessing ◦Feature generation ◦Imputation ◦Cleaning or merging of categories ◦Outlier removal ◦Anything that changes your data from the original form
4.Your Final Analysis in small pieces with annotation 5.Graphs to visualize different steps in your analysis 6.Clear discussion of why you made analysis choices 7.References to papers or citations you used to make decisions about the analysis
This article aims to provide a reproducible analytics workflow solving a hypothetical business problem using a predictive regression model on the UCI Wine data. A hypothetical business value proposition along with the full analytics solution will be provided. The important processes will also be documented.
The datasets used is obtained from UCI Machine Learning Repository. The two datasets are related to red and white variants of the Portuguese “Vinho Verde” wine. For more details, consult: https://www.vinhoverde.pt/en/
The business background of the project is to provide an analytics solution to the sales team in assisting with predicting wine quality / expert rating using only its chemical feature. The sales team would like to estimate on the wine quality before being in an expert so that they can potentially reduce cost by only have the expert rate the wine that’s estimated to have higher quality.
Paulo Cortez, University of Minho, Guimarães, Portugal, http://www3.dsi.uminho.pt/pcortez, A. Cerdeira, F. Almeida, T. Matos and J. Reis, Viticulture Commission of the Vinho Verde Region(CVRVV), Porto, Portugal, 2009
Download red wine dataset: https://archive.ics.uci.edu/ml/machine-learning-databases/wine-quality/winequality-red.csv
Download white wine dataset: https://archive.ics.uci.edu/ml/machine-learning-databases/wine-quality/winequality-white.csv
The datasets include 12 variables:
Input variables:
Output Variable:
For more information, read from this file: https://archive.ics.uci.edu/ml/machine-learning-databases/wine-quality/winequality.names
There are 12 columns available in both datasets. We can see that both datasets have the same attributes with the same data type, therefore it is reasonable for them to be combined.
wineRed = read.csv("winequality-red.csv",sep = ";")
wineWhite = read.csv("winequality-white.csv",sep = ";")
head(wineRed)
str(wineRed)
## 'data.frame': 1599 obs. of 12 variables:
## $ fixed.acidity : num 7.4 7.8 7.8 11.2 7.4 7.4 7.9 7.3 7.8 7.5 ...
## $ volatile.acidity : num 0.7 0.88 0.76 0.28 0.7 0.66 0.6 0.65 0.58 0.5 ...
## $ citric.acid : num 0 0 0.04 0.56 0 0 0.06 0 0.02 0.36 ...
## $ residual.sugar : num 1.9 2.6 2.3 1.9 1.9 1.8 1.6 1.2 2 6.1 ...
## $ chlorides : num 0.076 0.098 0.092 0.075 0.076 0.075 0.069 0.065 0.073 0.071 ...
## $ free.sulfur.dioxide : num 11 25 15 17 11 13 15 15 9 17 ...
## $ total.sulfur.dioxide: num 34 67 54 60 34 40 59 21 18 102 ...
## $ density : num 0.998 0.997 0.997 0.998 0.998 ...
## $ pH : num 3.51 3.2 3.26 3.16 3.51 3.51 3.3 3.39 3.36 3.35 ...
## $ sulphates : num 0.56 0.68 0.65 0.58 0.56 0.56 0.46 0.47 0.57 0.8 ...
## $ alcohol : num 9.4 9.8 9.8 9.8 9.4 9.4 9.4 10 9.5 10.5 ...
## $ quality : int 5 5 5 6 5 5 5 7 7 5 ...
head(wineWhite)
str(wineWhite)
## 'data.frame': 4898 obs. of 12 variables:
## $ fixed.acidity : num 7 6.3 8.1 7.2 7.2 8.1 6.2 7 6.3 8.1 ...
## $ volatile.acidity : num 0.27 0.3 0.28 0.23 0.23 0.28 0.32 0.27 0.3 0.22 ...
## $ citric.acid : num 0.36 0.34 0.4 0.32 0.32 0.4 0.16 0.36 0.34 0.43 ...
## $ residual.sugar : num 20.7 1.6 6.9 8.5 8.5 6.9 7 20.7 1.6 1.5 ...
## $ chlorides : num 0.045 0.049 0.05 0.058 0.058 0.05 0.045 0.045 0.049 0.044 ...
## $ free.sulfur.dioxide : num 45 14 30 47 47 30 30 45 14 28 ...
## $ total.sulfur.dioxide: num 170 132 97 186 186 97 136 170 132 129 ...
## $ density : num 1.001 0.994 0.995 0.996 0.996 ...
## $ pH : num 3 3.3 3.26 3.19 3.19 3.26 3.18 3 3.3 3.22 ...
## $ sulphates : num 0.45 0.49 0.44 0.4 0.4 0.44 0.47 0.45 0.49 0.45 ...
## $ alcohol : num 8.8 9.5 10.1 9.9 9.9 10.1 9.6 8.8 9.5 11 ...
## $ quality : int 6 6 6 6 6 6 6 6 6 6 ...
Prepare the data by checking missing data, merging datasets, and creating functions necessary for later analysis.
Metadata<-function(df){
library(DataExplorer)
library(kableExtra)
z<-introduce(df)
z<-as.data.frame(t(z))
colnames(z)<-c()
knitr::kable(
z,
caption="Data Introduction"
) %>% kable_styling(bootstrap_options = c("striped", "hover"),
full_width = F,
font_size = 12,
position = "left")
}
Based on the output below, we can see that there are no missing data present in both red and white wine datasets.
library(tidyverse)
wineRed %>% is.na() %>% colSums()
## fixed.acidity volatile.acidity citric.acid
## 0 0 0
## residual.sugar chlorides free.sulfur.dioxide
## 0 0 0
## total.sulfur.dioxide density pH
## 0 0 0
## sulphates alcohol quality
## 0 0 0
wineWhite %>% is.na() %>% colSums()
## fixed.acidity volatile.acidity citric.acid
## 0 0 0
## residual.sugar chlorides free.sulfur.dioxide
## 0 0 0
## total.sulfur.dioxide density pH
## 0 0 0
## sulphates alcohol quality
## 0 0 0
The different wine types are transformed into dummies and stored in a categorical variable named wine.type.
# Create Dummies
wineRed$wine.type <- 1
wineWhite$wine.type <- 0
# Check Dimensions
dim(wineRed)
## [1] 1599 13
dim(wineWhite)
## [1] 4898 13
# Combine dataframes
wine = rbind(wineRed, wineWhite)
# Recode attribute as factor
wine$wine.type <- as.factor(wine$wine.type)
# Double check dimension
dim(wine)
## [1] 6497 13
| rows | 1599 |
| columns | 13 |
| discrete_columns | 0 |
| continuous_columns | 13 |
| all_missing_columns | 0 |
| total_missing_values | 0 |
| complete_rows | 1599 |
| total_observations | 20787 |
| memory_usage | 163576 |
| rows | 4898 |
| columns | 13 |
| discrete_columns | 0 |
| continuous_columns | 13 |
| all_missing_columns | 0 |
| total_missing_values | 0 |
| complete_rows | 4898 |
| total_observations | 63674 |
| memory_usage | 493472 |
| rows | 6497 |
| columns | 13 |
| discrete_columns | 1 |
| continuous_columns | 12 |
| all_missing_columns | 0 |
| total_missing_values | 0 |
| complete_rows | 6497 |
| total_observations | 84461 |
| memory_usage | 627904 |
To explore the dataset, we can take a look at a scatterplot for the response. We will use histograms for numerical variables, bar chart for categorical variables. Some other summarized statistics were also shown down below.
From the scatterplot, we cannot see any obvious trend between any variables and response.
From the histogram, we can see that most numeric variables are right skewed and unimodel which makes sense as these variables can’t have negative values. PH and quality are both normally distributed. Total sulfur dioxide seems to be bimodel.
The barplot shows the distribution of red and white wine with more white wine in the data.
head(wine)
## Scatterplot `quality` with all other continuous features
plot_scatterplot(split_columns(wine)$continuous, by = "quality", sampled_rows = 1000L)
library(DataExplorer)
plot_histogram(wine)
library(DataExplorer)
plot_bar(wine)
summary(wine)
## fixed.acidity volatile.acidity citric.acid residual.sugar
## Min. : 3.800 Min. :0.0800 Min. :0.0000 Min. : 0.600
## 1st Qu.: 6.400 1st Qu.:0.2300 1st Qu.:0.2500 1st Qu.: 1.800
## Median : 7.000 Median :0.2900 Median :0.3100 Median : 3.000
## Mean : 7.215 Mean :0.3397 Mean :0.3186 Mean : 5.443
## 3rd Qu.: 7.700 3rd Qu.:0.4000 3rd Qu.:0.3900 3rd Qu.: 8.100
## Max. :15.900 Max. :1.5800 Max. :1.6600 Max. :65.800
## chlorides free.sulfur.dioxide total.sulfur.dioxide density
## Min. :0.00900 Min. : 1.00 Min. : 6.0 Min. :0.9871
## 1st Qu.:0.03800 1st Qu.: 17.00 1st Qu.: 77.0 1st Qu.:0.9923
## Median :0.04700 Median : 29.00 Median :118.0 Median :0.9949
## Mean :0.05603 Mean : 30.53 Mean :115.7 Mean :0.9947
## 3rd Qu.:0.06500 3rd Qu.: 41.00 3rd Qu.:156.0 3rd Qu.:0.9970
## Max. :0.61100 Max. :289.00 Max. :440.0 Max. :1.0390
## pH sulphates alcohol quality wine.type
## Min. :2.720 Min. :0.2200 Min. : 8.00 Min. :3.000 0:4898
## 1st Qu.:3.110 1st Qu.:0.4300 1st Qu.: 9.50 1st Qu.:5.000 1:1599
## Median :3.210 Median :0.5100 Median :10.30 Median :6.000
## Mean :3.219 Mean :0.5313 Mean :10.49 Mean :5.818
## 3rd Qu.:3.320 3rd Qu.:0.6000 3rd Qu.:11.30 3rd Qu.:6.000
## Max. :4.010 Max. :2.0000 Max. :14.90 Max. :9.000
str(wine)
## 'data.frame': 6497 obs. of 13 variables:
## $ fixed.acidity : num 7.4 7.8 7.8 11.2 7.4 7.4 7.9 7.3 7.8 7.5 ...
## $ volatile.acidity : num 0.7 0.88 0.76 0.28 0.7 0.66 0.6 0.65 0.58 0.5 ...
## $ citric.acid : num 0 0 0.04 0.56 0 0 0.06 0 0.02 0.36 ...
## $ residual.sugar : num 1.9 2.6 2.3 1.9 1.9 1.8 1.6 1.2 2 6.1 ...
## $ chlorides : num 0.076 0.098 0.092 0.075 0.076 0.075 0.069 0.065 0.073 0.071 ...
## $ free.sulfur.dioxide : num 11 25 15 17 11 13 15 15 9 17 ...
## $ total.sulfur.dioxide: num 34 67 54 60 34 40 59 21 18 102 ...
## $ density : num 0.998 0.997 0.997 0.998 0.998 ...
## $ pH : num 3.51 3.2 3.26 3.16 3.51 3.51 3.3 3.39 3.36 3.35 ...
## $ sulphates : num 0.56 0.68 0.65 0.58 0.56 0.56 0.46 0.47 0.57 0.8 ...
## $ alcohol : num 9.4 9.8 9.8 9.8 9.4 9.4 9.4 10 9.5 10.5 ...
## $ quality : int 5 5 5 6 5 5 5 7 7 5 ...
## $ wine.type : Factor w/ 2 levels "0","1": 2 2 2 2 2 2 2 2 2 2 ...
library(DataExplorer)
plot_correlation(wine, type = "c")
Based on the correlation plot above, some of the predictors have strong correlations:
These strongly correlated variable could potentially lead to multicolinearity.
set.seed(13)
trainIndex = sample(1:nrow(wine), size = round(0.75*nrow(wine)), replace=FALSE)
train<-wine[trainIndex, ]
valid<-wine[-trainIndex, ]
nrow(train)
## [1] 4873
nrow(valid)
## [1] 1624
library(MASS)
# Fit the full model
full <- lm(quality ~., data = train)
# Stepwise regression model
step <- stepAIC(full, direction = "both", trace = FALSE)
summary(step)
##
## Call:
## lm(formula = quality ~ fixed.acidity + volatile.acidity + residual.sugar +
## free.sulfur.dioxide + total.sulfur.dioxide + density + pH +
## sulphates + alcohol + wine.type, data = train)
##
## Residuals:
## Min 1Q Median 3Q Max
## -3.5440 -0.4690 -0.0480 0.4567 3.0005
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1.178e+02 1.798e+01 6.553 6.23e-11 ***
## fixed.acidity 1.027e-01 1.895e-02 5.419 6.28e-08 ***
## volatile.acidity -1.530e+00 8.799e-02 -17.392 < 2e-16 ***
## residual.sugar 6.846e-02 7.206e-03 9.501 < 2e-16 ***
## free.sulfur.dioxide 3.898e-03 8.880e-04 4.390 1.16e-05 ***
## total.sulfur.dioxide -1.261e-03 3.756e-04 -3.356 0.000796 ***
## density -1.179e+02 1.825e+01 -6.457 1.17e-10 ***
## pH 6.013e-01 1.066e-01 5.642 1.78e-08 ***
## sulphates 6.918e-01 8.790e-02 7.871 4.32e-15 ***
## alcohol 2.169e-01 2.290e-02 9.469 < 2e-16 ***
## wine.type1 3.660e-01 6.754e-02 5.418 6.30e-08 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.7342 on 4862 degrees of freedom
## Multiple R-squared: 0.2991, Adjusted R-squared: 0.2977
## F-statistic: 207.5 on 10 and 4862 DF, p-value: < 2.2e-16
p.valid<-predict(step, newdata=valid)
head(p.valid)
## 5 7 13 15 17 20
## 4.928785 5.066270 5.376766 5.228540 6.034067 5.690349
library(caret)
RMSE(p.valid, valid$quality)
## [1] 0.7319179
R2(p.valid, valid$quality)
## [1] 0.2857878
pressure
paste("The mean pressure is:", round(mean(pressure$pressure),3), "mm")
[1] “The mean pressure is: 124.337 mm”
library(knitr)
kable(head(pressure), format="pipe", digit=3)
| temperature | pressure |
|---|---|
| 0 | 0.000 |
| 20 | 0.001 |
| 40 | 0.006 |
| 60 | 0.030 |
| 80 | 0.090 |
| 100 | 0.270 |
kable(tail(pressure), format="pipe", digit=3)
| temperature | pressure | |
|---|---|---|
| 14 | 260 | 96 |
| 15 | 280 | 157 |
| 16 | 300 | 247 |
| 17 | 320 | 376 |
| 18 | 340 | 558 |
| 19 | 360 | 806 |
You can also embed plots, for example:
For more details on organizing with tabset go here https://bookdown.org/yihui/rmarkdown-cookbook/html-tabs.html.
You can include both inline and offset equations.
You can include inline equations like \(y = nx + b\), you can also do more complicated inline equations such as \(\hat{y} = \hat{\beta} + \hat{\beta_1}x\)